Occam's razor meets WMAP 
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Using a variety of quantitative implementations of Occam's razor we examine the low quadrupole, 
the "axis of evil" effect and other detections recently made appealing to the excellent WMAP data. 
We find that some razors fully demolish the much lauded claims for departures from scale-invariance. 
They all reduce to pathetic levels the evidence for a low quadrupole (or any other low (. cut-off), 
both in the first and third year WMAP releases. The "axis of evil" effect is the only anomaly 
examined here that survives the humiliations of Occam's razor, and even then in the category of 
"strong" rather than "decisive" evidence. Statistical considerations aside, differences between the 
various renditions of the datasets remain worrying. 



I. INTRODUCTION 

A better fit to the data can always be obtained by 
appealing to a theory containing more free parameters. 
The extra knobs can't harm, and quite often help the 
job of fitting data. Intellectual honesty, however, tells 
us that a better fit may then not signal evidence for the 
theory, but merely unfair advantage over its competitors. 
Confronted with two theories fitting the data equally well 
we'd prefer the simpler one, the theory containing fewer 
parameters or based on a less complicated model. 

Such considerations form the basis of Occam's razor, 
but a quantitative formulation is notoriously hard to 
come by. It's clear that the real "evidence" should com- 
bine the naive goodness of fit with a penalty function 
measuring the complexity of the theory. But several dis- 
tinct rationales for doing this may be found in the litera- 
ture, notably the Akaike ^IjJ and Bayesian "2'] information 
criteria (AIC and BIC) and the Turing machine based 
criterion proposed by one of us Q ■ Simplicity, it seems, 
is in the eye of the beholder. 

Furthermore, subjective double standards seep into 
the analysis, and the rigors of penalization are often 
reserved to results one doesn't like. For example, the 
CMB community has resisted applying Occam's razor to 
inflationary parameters (see jj, H, |^ for notable excep- 
tions) and to some power spectrum features [3,ll3; but 
with reference to anomalies unpalatable to just about 
everyone (such as the "axis of evil" effect, the embar- 
rassing statistical anisotropy exhibited on the largest an- 
gular scales |lll[l3,[i3)j the strictest penalization is en- 
forced jlSj . (The criterion employed therein to scrutinize 
the axis of evil effect is loosely the AIC.) 

We applaud this type of application of Occam's razor, 
but we believe it should be employed impartially. The 
purpose of the present paper is to examine some of the 
proposed Occam razors, and to apply them democrati- 
cally to both "likable" and "undesirable" features in the 
large -ang le CMB anisotropy. We examine the WMAP 
data [i^ lia, [i3 , in its first and third year releases, and 
in various renditions dealing differently with the galactic 



plane. We focus on claims for departures from scale in- 
variance and for reionization (Section^J, the evidence for 
a low quadrupole and a low £ power cut-off (Section Ullll . 
and the strength of the detection of the so called axis of 
evil effect (Section Hvj) . 



II. BRANDS OF OCCAM'S RAZOR 

We first review some well-known criteria for evidence, 
adopting a notation similar to that of Q- Let £ be the 
likelihood and k the number of parameters of the model. 
They will be tuned so as to maximize the likelihood or, 
equivalently, minimize the information /. The informa- 
tion in the data given the theory is defined as minus the 
logarithm of the likelihood. But in fact we want to mini- 
mize the information in the data and the theory together, 
that is: 



I{D,T) = I{D\T)+I{T) 



(1) 



so that I{T) is the penalty referred to above. 

According to some authorities, strong evidence for a 
theory over a "base model" requires an improvement in 
I{D, T) by at least 3 (see |l|i|). The title of "decisive ev- 
idence" is not normally bestowed unless the improvement 
exceeds 5. 

All the razors we will wield fit into the above scheme, 
but they differ in how they define I{T). According to the 
Akaike information criterion (AIC) the information in a 
theory is simply its number of parameters, so that 



Ia{D,T) = -\nC + k. 



(2) 



This is obtained by an approximate minimization of the 
KuUback-Liebler information entropy. 

Rather different is the the Bayesian information crite- 
rion (BIC), based on the penalty 



Ib{T) 



■InN 



(3) 



where N is the number of data points being fit. It results 
from an approximation to the true Bayesian evidence. 



giving the model a uniform prior. The full Bayesian ev- 
idence, where one integrates the likelihood over the full 
set of parameters, has also been considered. 

The criterion developed in 'j| interprets I{T) entropi- 
cally and algorithniically. It estimates the information in 
a theory T in terms of the number of bits of "memory" 
required to store the parameters. From this point of view, 
a theory's complexity depends not only on how many pa- 
rameters it contains, but also on the precision with which 
they are stored. The resulting penalty term /(T) is not 
simply a function of N and k but depends on the details 
of the theory. Typically it includes a term equal to (3) as 
well as other terms that ensure that a theory with more 
parameters than data points will never be judged a good 
fit. The advantage this approach has over AIC and BIC 
is that it never has to appeal to the asymptotic approx- 
imation A^ ::^ 1. It's disadvantage is that it is harder 
to apply since I(T) is not a simple universal function of 
N and k. In the spirit of the abbreviations "AIC" and 
"BIC" we will refer to this third razor as "HIC" or sim- 
ply "7f " , since the differential goodness of fit is denoted 
by H in 0. 

The various criteria do not always agree, even qualita- 
tively. Take the statements that WMAP displays strong 
evidence for reionization and a spectrum of density fluc- 
tuation that is not scale invariant Jl5|. Both assertions 
rely on an improvement to the fit AI{D\T) w Ax^/2 = 
—4 (see Table 3 in [l3|) and for both, this costs an extra 
parameter. Using AIC we get AI{D,T) = —3 pointing 
toward a detection. But A^ is between 1500 and 3100, so 
InA^ is around 8. Using BIC this impHes AI{D,T) « 0, 
most definitely not a detection. We have not worked out 
the HIC value of /, but we would expect it to resem- 
ble BIC more than AIC, leading again to the verdict of 
"no detection" . Of course one cannot drop both depar- 
tures from scale invariance and reionization and there 
are strong astrophysical reasons for preferring reioniza- 
tion to a tilted spectrum. Therefore based on WMAP it 
seems prudent to say that there is no strong evidence for 
ns 7^ 1. 



III. IS THE QUADRUPOLE UNDERPOWERED? 

Much attention has been paid to the low power ob- 
served in the lowest multipoles {£ = 2 in particular), but 
how strong is the evidence when shaved with Occam's 
razor? This is essentially a problem of variance estima- 
tion. Given a sample and an externally inferred variance 
a^ , when is it worth revising a'^ in the light of the sam- 
ple? Here a% is obtained by appealing to a theory of the 
whole spectrum, dependent only on a small number of 
parameters (e.g. fl and Qa). These are fixed primarily 
by the higher multipoles (the Doppler peaks), so as far 
as the low multipoles are concerned a^ is external. 

The "null hypothesis" HO is that a'^ is correct, and the 
observed low power a fluke. Since the costs of estimating 
cr| are borne elsewhere, I{T) = and I{D, T) = I{D\T). 



The catch is that the fit to the data is far from perfect. 
Introducing the "observed variance" of the sample. 



TV 11^^ 



we have 

I{D,T) 



.\nP{D\T)^- 



\nat + ^ 



(4) 



(5) 



far from its minimum. 

The alternative hypothesis HI is that the power is in- 
deed low and that a"^ should be replaced by an internal 
estimate, erf, obtained using the sample and bearing its 
costs. The procedure for applying HIC can be adapted 
from Q and goes as follows (the only novelty is that here 
the average is known). Firstly, we minimize I{D\T), with 
solution ctI — cr%- This cannot be stored to infinite accu- 
racy, so we expand around the minimum: 



I{D\as. Aa) ^^[lnal + l]+N (— 



(6) 



Averaging over a uniform distribution in Act G 
(-(Sct/2, <5ct/2) gives (Act^) = 5ctV12, so that: 



/(Z?|CTS,^CT) = f[lnCT,^ + l]+^(| 

The storage penalty, on the other hand, is 

/(T) = -ln^ 
so I{D,T) is minimized for optimal accuracy: 



Sa 



CTS. 



Thus the information in the data and HI is 



IiD,T) 



N, 



[lnCT2+l] + --ln 



(7) 



(8) 



(9) 



(10) 



The evidence H against the null hypothesis HO is the dif- 
ference between its information and that in HI (positive 
H favors HI). This may be written as iJ = Hf — Hp, 
where Hf is the improvement in the fit 
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(11) 



(this is often approximated by — Ax^/2), and Hp, the 
penalty paid by HI for introducing a new parameter, is 



Hr, 
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An exact rendition of this argument (not appealing to 
Taylor expansion 0) leads to penalty 



Hi^ — — 
^ 2 



In 



N - 1 
6 



A^ln- 



N 



N - 1 



+ 4,{N) (13) 



TABLE I: Evidence for a low quadrupole, based on various 
datasets and Occam's razors H , AIC and BIG. 



Map 


Hf 


H H^'^ 


jjBIC 


ILCl 


2.47 


2.11 1.47 


1.67 


TOH 


2.62 


2.26 1.62 


1.81 


DILC 


2.08 


1.72 1.08 


1.27 


WMAP3 


2.32 


1.96 1.32 


1.51 



where 4'{N) is a small negative correction, monotonic in 
N, that never exceeds 0.2 in magnitude and is totally 
neghgible for TV > 5 (for example V(10) = -0.03). The 
AIC would instead quote H^^^ = 1 (with H^^^ = Hf- 
H^^'^'), whereas the BIG would introduce: 



H, 



BIC 



= -InN 



(14) 



(with H"^^ ^ Hf- H£^^) which in the large TV hmit is 
the same as (|12ll (or (|13|) ') plus constant 0.4. Generaliza- 
tion for many independent parameters is straightforward. 

In Table ^ we examine the evidence for a low 
quadrupole. We consider the first year data as in [l3| 
(ILGl) and in le| (TOH), as well as the third year re- 
lease 16J, both the debiased internal linear combination 
map (DILG) and the MLE estimate (WMAP3). Glearly 
under Occam's razor we can never claim a significant 
detection, whatever the dataset. Adding the octupole 
and other low £ does little to iniprove the situation. Vi- 
sual inspection of the plots in U&\ shows that many of 
these low £ "anomalies" have disappeared in the three 
year data. But they were never significant, as the anal- 
ysis of the first year data presented in Table |n] shows. 
Naturally Hf improves as more and more multipoles are 
considered, but these bring in new parameters and so 
the associated "detections" are erased under the weight 
of Occam's razor. This table refers to first year TOH; in 
other datasets/renditions the evidence is even lower. By 
bringing more £s into the analysis the evidence decreases 
further. 

None of this will surprise several authors [T^ [13, llfl 
I2EI2II; yet, to drive the point home we stress that the 
evidence for a low quadrupole ~ bad as it is - is still 
stronger than the evidence for a non scale invariant spec- 
trum under the BIG. Also the message has yet to fully fil- 
ter to enthusiastic theorists (e.g. Q). For example claims 
have been made |2^] that DGP gravity |23| fits better the 
low £ spectrum. While it might be true that the theory 
achieves a better fit without introducing new parameters 
(and therefore doesn't fall prey of further penalties) the 
fact remains that it corrects a misfit that is not significant 
to begin with. 



IV. THE AXIS OF EVIL 

Many paths lead to the axis of evil. Planarity statis- 
tics JT,], Maxwell multipole vectors [13, 123, and m- 



TABLE II: Evidence (or lack thereof) for low power at small 
£ using the most sympathetic dataset (TOH). 



^ or ^ range 


Hf 


H H^'^ H^^^ 


2 


2.62 


2.26 1.62 1.81 


3 


0.35 


-0.18 -0.65 -0.62 


2-3 


2.98 


2.08 0.97 1.19 


4 


1.18 


0.51 0.18 0.09 


2-4 


4.16 


2.59 1.15 1.28 



TABLE III: The planarity of the ^ = 2, 3 modes using TOH 
(top rows) and WMAP3 (bottom). There is nothing anoma- 
lous with the planarity of £ = 2 and £ — 3, taken on their 
own. It's the fact that the planarity occurs in roughly the 
same direction (and with roughly the same suppression ratio 
e) for both multipoles that substantiates the anomaly. 



Data 


£s 


(6 e 


Hf 


fjAIC JjBIC 


TOH 


2 
3 
2-3 


58 -103 .030 
62 -121 .025 
61 -113 .032 


3.09 
5.06 

7.48 


0.09 0.68 
2.06 2.14 
4.48 3.76 


WMAP3 


2 
3 
2-3 


70 -127 .036 
62 -122 .035 
64 -123 .038 


2.84 
4.29 
6.89 


-0.16 0.43 
1.29 1.37 
3.89 3.16 



preference statistics [13 are examples. Here we focus 
on the planarity of £ = 2, 3, that is, the fact that in the 
frame pointing to (b,l) « (60,-100) in Galactic coordi- 
nates, the power is concentrated in the m ~ ±£ modes. 
How seriously should we take this? 

The more abstract estimation problem is: when is it 
justified splitting the a^m sample into sub-samples with 
different variances? This is a variation on the calculation 
in the previous section with a subtlety: the result is frame 
dependent. Consider a sample with N elements and sam- 
ple variance (t| (the 2£ -I- 1 modes of a multipole) , and 
two sub-samples with A'^i and N2 elements and sample 
variance a^-^ and 1752 (the planar modes m = zL£, and all 
the others). The difference in I{D\T) between the null 
hypothesis (don't split the sample) and the alternative 
hypothesis (split) is 



.N 



Hf 



In 



"si "S2 



where N ^ Ni + N2 and 7Vct| == Nia^^ 
depends only on the suppression ratio 



(15) 



iV2cr|2- This 



(16) 
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and therefore one can consider the issue of planarity even 
if the evidence for an internal cr| is small or nonexistent. 
The value of e depends on the z-axis coordinates {b,l), 
which should be chosen to maximize Hf. In the process 
we add two more parameters to the Occam's razor bill. 

This is the procedure adopted for analyzing each mul- 
tipole independently and in Table UTTl we present results 



for two datasets: TOH and the WMAP three year data. 
We find that Hj is around 3 for £ — 2 and 5 for ^ = 3, 
at the cost of introducing 3 parameters (the axis and the 
ratio of power e) for each multipole. Using AIC this de- 
grades Hf to a 7? around and 2, respectively. Results 
for the BIC are reported in the same table. As in pre- 
vious studies ^U, ns\ we find no serious evidence for an 
anomaly if each multipole is taken on its own. Given a 
random, statistically isotropic multipole there is always 
a frame in which most of the power is concentrated in a 
single m; that this m equals i is not unlikely for small £. 
What turns the axis of evil into a menace is that the 
maximal Hf for i = 2 and £ = 3 is reached with roughly 
the same parameters (see values in Table III) . Thus if we 
take a single axis and e chosen so as to maximize the total 
HTf = Hqf + Hof, we obtain a Hxf only slightly worse 
than the sum of the separate optimal Hgf and Hof', 
the parameter cost, however, is halved. Our results are 
described in Table III. The search for the joint axis was 
done numerically, and we see that the result is heavily 
weighed by the octupole. The common e was found via 
the method of Lagrange multipliers, i.e. by maximizing 



Hxf — H( 



Qf 



Hof - A[crQiO-52 



2 2 1 



(17) 



with solution: 



2 



"SQi 

l±A/2 
IT A/2 



where i — 1,2 indexes the sub-samples and A is the solu- 
tion of a quadratic equation expressing to — eg (an equa- 
tion that only depends on the sample ratio eso/^SQ-) 

As shown in Table III our evidence for an anomaly is 
always above H = 3, i.e. "strong evidence". One may 
therefore wonder where is the discrepancy with the anal- 
ysis in [13? In that work the axis of evil was modeled as 
a modulation by an underlying large-scale function, and 
a model was found with Hf = 4 (a chi-squared improve- 
ment of 8) at a cost of 8 parameters. Using either AIC 
or BIC the value of H is therefore negligible. However, 
here we exhibited a model improving the fit by about 
Hf = 7 at a cost of 3 parameters. This (phenomenologi- 
cal) model is simply based on a diagonal covariant matrix 
for £ = 2,3 oi the form: 



(|a^,„P)(n) =q((5^|„j| +e(l-(5f|„i|)) 



(18) 



Hence the poor evidence reported in [l5j is not a defi- 
ciency of the axis of evil effect or the data, but merely 
a shortcoming of the proposed model itself. One can al- 
ways find a model for any anomaly containing a number 
of parameters so large as to drive H down to a small 
value. But the issue is: what is the value of H for the 
best model of that anomaly, the model with the optimal 
trade off between fit and number of parameters? We have 
gone a fair way toward answering this question. 



V. CONCLUSIONS 

In this paper we subjected to some of Occam's ra- 
zors three patterns that people have claimed to see in 
the CMB data: departures from scale invariance, a low 
quadrupole, and the anisotropy that has come to be 
known as the "axis of evil" . Specifically, we considered 
the razors that we called AIC, BIC and HIC. All three 
agreed to discount the claim for a low quadrupole, while 
in contrast, the two that we brought to bear on the axis 
of evil both suggested that it should be taken seriously. 
Only in relation to scale-invariance was there disagree- 
ment, with AIC tending to accept the claim and BIC 
definitely rejecting it. (We did not consult HIC in con- 
nection with the first and third effects, but we plan to do 
so in a later version of this preprint.) 

It is somewhat embarrassing that Occam razors can 
disagree, but a glance at equations (2) and (3) reveals 
that this is inevitable, since the penalty terms N and 
In ViV are very different when the number of data points 
iV is 3> 1. By comparing these two expressions, one 
sees that BIC will be more lenient than AIC when N is 
small, but much tougher when N is big (the crossover 
coming around N = 7). For HIC, it is harder to make 
a blanket statement, but experience has shown that it 
tends to agree more closely with BIC, probably since each 
relies, in its own way, on a version of Bayes' rule. 

In the case of the claimed departure from scale- 
invariance, we would thus expect HIC to agree with BIC 
in favoring a negative verdict, which at the very least 
should be added as a word of caution to the conclu- 
sions reported in [l^. By way of comparison, it's worth 
pointing out that, even if we accept the more favorable 
value of H coming from AIC, the evidence for scale non- 
invariance is no better than that for the "axis of evil" . 
When all razors agree on a lack of evidence, as is the 
case with the underpowered quadrupole, one should defi- 
nitely not lose sleep over the anomaly, and we hope keen 
theorists will divert their creativity elsewhere. 

But even when different razors agree on an anomaly - 
such as the axis of evil - one should not trust the result 
blindly. The issue of systematics remains of paramount 
importance, as shown by the significant differences in 
H obtained from the various datasets and methodolo- 
gies used to deal with the galactic foregrounds. And 
one should bear in mind that even the most enthusias- 
tic "Ockhamist" would be unlikely to claim for his or 
her favorite razor a freedom from ambiguity |2y| better 
than AH = ±0.3 or so. In addition it's probably fair 
to say that the trouble of rewriting cosmology textbooks 
deserves in itself a penalty factor. This is hard to evalu- 
ate but it may translate into the requirement of a higher 
level of evidence than "strong" , at the phenomenological 
level. Perhaps the ever improving polarization maps will 
have a say on the matter and tilt the scales. This issue 
is currently being very actively investigated. 
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